The report was commissioned to analyze sample of the claims data collected by auto-insurance provider, Indian Money, Bangalore, India. The purpose of the analysis is to predict the factors affecting the profitability of the company. The insurance industry in India is $17 billion industry and it is imperative to target the right audience with the right deals to stay abreast with the competition and keep up with profits. The two factors which determines profit for an insurance company are Premiums and Claims Amount. Revenue = Premium – Claim Amount (Keeping Operational Cost Constant)
Based on the research question – demographics (age group, gender and region) affecting profitability of the company and the initial analysis of the dataset; three hypotheses were formulated.
The profitability for Indian Money is higher for female drivers in North Region. Explanation: Female drivers are more responsible as compared to male drivers and North region being the most affluent region; we speculated that there will be a high co-relation between female drivers in north region with profits.
The profitability for Indian Money is higher for drivers above 40 years of age. Explanation: Experienced drivers drive cautiously as compared to young and inexperienced drivers.
The profitability of Indian Money is higher for vehicles with cubic capacity more than 2200. Explanation: The vehicles with higher cubic capacity has higher premium and owners drive cautiously.
The revenue is calculated from premium and claim amount. Profit is calculated as proportion of revenue. The dependent variable is profit and independent variables were taken into accounts such as age group, gender, IDV (Insured Declare Values), Year of Manufacture and region (State, City) which can act as probable indicators to define profitability of Indian Money.
Statistical measures and visualization techniques are used to analyze the variables both at univariate and bivariate level. Transformation was performed, and outliers were removed for a normal distribution and a reliable linear model. If the variable is categorical and it contained too few observation, then multiple categories were clubbed, and new derived factor variable was formed.
Dataset is a sample of data maintained by one of the auto-insurance providers named Indian Money, Bangalore, India. We have received the data from one of the colleague working with this company. Data is collected by the organization during claims processing and reporting of claim data at the end of the year. The data contains 7702 observations and 15 variables.
## Policy.Number Year IDV City State Cubic.Capacity
## 1 3214946 2009 650119 bangalore karnataka 2200
## 2 3215858 2010 745688 kolkatta west bengal 1500
## 3 3216013 2007 236971 newdelhi ncr 1200
## 4 3216152 2011 791024 gurgaon haryana 1500
## 5 3216372 2009 259162 bangalore karnataka 1000
## Mfr.Model Premium Type Gender Channel Age Cover.Type
## 1 Tata Safari 17427 Renewal Male Broker 55-64 Comprehensive
## 2 Honda City 22305 From Comp Female Broker 18-24 Comprehensive
## 3 Maruti Swift 6573 From Comp Female Direct 25-34 Comprehensive
## 4 Honda City 23261 From Comp Male Direct 45-54 Comprehensive
## 5 Maruti Wagon-R 7382 From Comp Male Direct 55-64 Comprehensive
## PaymentFrequency ClaimsInd Claim.Amount Zone Vehicle_Cat Revenue
## 1 Annual 0 0 South CC-large sized 17427
## 2 Monthly 0 0 East CC-medium sized 22305
## 3 Monthly 0 0 North CC-small sized 6573
## 4 Annual 0 0 North CC-medium sized 23261
## 5 Monthly 1 80694 South CC-small sized -73312
The auto insurance industry in India is estimated to have Dollar 17 billion value in 2025. Industry is exponential growing and reached to $15 billion in 2017. Companies are competing for dollars with each other and trying to reduce the claims filed by customers. Generating revenue and making good profit is the key to success and to stay ahead of the competition. We have received a sample containing 7702 observations collected during claim processing by the company over 7 years (2005 to 2011). We have tried to identify the factors which affect the probability of profit generation for the company.
We have focused our efforts on analysis of factors impacting the profitability of the company. We have finalized following research questions,
The profitability of Indian Money is higher for female drivers in North zone.
The profitability of Indian Money is higher for drivers having age more than 40 years.
The profitability of Indian Money is higher for vehicles with cubic capacity more than 2200.
| Name | Data Type | Variable | Description |
|---|---|---|---|
| Year | Factor | CV | 7 Years of Data |
| IDV | Numeric | IV | Insured Declared value of Car |
| Gender | Factor | IV | Male or Female |
| Zone | Factor | IV | Divided Regions of the Country |
| Age Group | Factor | IV | Age groups of Applicants |
| ClaimsInd | Factor | IV | Claims Taken (0-not taken,1-Taken) |
| Vehicle Category | Factor | IV | Clubbed to Cubic Capacity size |
| Revenue | Numeric | DV | Derived from Premium and Claim data (Premium - Claim) |
| Revenue | Numeric | DV | Derived column (Revenue %) (Revenue/ Max(Revenue) * 100) |
Dataset has 15 variables and 7702 observations.
Dependent Variable: Premium and Claim Amount.
Independent variables: Age Group, IDV and Gender.
Derived variables: Zone, Vehicle Category, Revenue and Profit.
Dependent Variables IDV: This is Insured Declared Value of the car. This is the valuation of the car according to the rules of the insurance company.
Age Group: We originally had various conflicting age groups which were re-levelled. And finally, we have age groups: 18-24, 25-34, 35-44, 45-54, 55-64,65+
Gender: Male and Female.
Independent Variables Premium: This is the total premiums paid for the policy by a customer at the beginning of the year.
Claim Amount: This is the amount claimed by the customers.
Derived Variables Zone:
• Derived from state variable.
• Divided into four zones – North, South, East, West.
• Condition:
o North:
o South:
o East:
o West:
• Reason: Very few observations in most states.
Vehicle Category: • Derived from Cubic Capacity Variable.
• Divided into three- vehicle categories - CC-large sized, CC-medium sized, CC-small sized.
• Condition:
o CC-large sized: cubic capacity >1800
o CC-medium sized: cubic capacity >1250 and <1800
o CC-small sized: cubic capacity <1250
• Reason: Very few observations in certain categories of cubic capacity.
Revenue:
• Derived from premium and claim amount.
• Revenue = Premium – Claim Amount.
• Reason: Revenue generated per person.
Profit:
• Derived from revenue.
• Profit = Revenue/ Max(Revenue) * 100.
• Reason: Profit as a proportion of revenue.
PROFIT
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 7702 80.45 6.32 81.95 81.51 1.3 0 100 100 -3.14 18.23
## se
## X1 0.07
Analysis: The profit data does not show a normal distribution and it is negatively skewed. The boxplot shows we have many outliers.
IDV
## vars n mean sd median trimmed mad min max
## X1 1 7702 385617.3 246733.8 305093 343581.4 136637.9 111822 1790603
## range skew kurtosis se
## X1 1678781 2.08 5.63 2811.43
Analysis: IDV variable is right skewed because of vehicles which have higher insurance declared value. The boxplot shows we have many outliers.
ZONE
##
## East North South West
## 935 4163 1728 876
##
## East North South West
## 12.13970 54.05090 22.43573 11.37367
Analysis: Zone is categorical variable which is derived from State variable in the dataset. North zone has the most of the observations.
Age
##
## 18-24 25-34 35-44 45-54 55-64 65+
## 1145 1500 2089 1654 782 532
##
## 18-24 25-34 35-44 45-54 55-64 65+
## 14.866269 19.475461 27.122825 21.474942 10.153207 6.907297
Analysis: Age group is categorical variable which groups drivers in 6 categories. The age group 35-44 has more observations as compared to the other age groups.
Gender
##
## Female Male
## 2134 5568
##
## Female Male
## 27.70709 72.29291
Analysis: Gender is a categorical variable. Dataset has more male drivers than female drivers.
Vehicle_cat
##
## CC-large sized CC-medium sized CC-small sized
## 1023 2571 4108
##
## CC-large sized CC-medium sized CC-small sized
## 13.28226 33.38094 53.33680
Analysis: Vehicle category is a derived variable from cubic capacity variable in the dataset. It groups vehicles in 3 different categories. The small-sized vehicles has most of the obsevations.
Year
##
## 2005 2006 2007 2008 2009 2010 2011
## 272 693 870 988 1154 1980 1745
##
## 2005 2006 2007 2008 2009 2010
## 0.03531550 0.08997663 0.11295767 0.12827837 0.14983121 0.25707608
## 2011
## 0.22656453
ClaimsInd
##
## 0 1
## 5825 1877
HANDLING OUTLIERS Creating function outlier to retrieve extreme outliers i.e + or - 3 times IQR of upper quartile an lower quartile respectively
outliers <- function(column) {
lowerq <- as.vector(quantile(column)[2]) # returns 1st quartile
upperq <- as.vector(quantile(column)[4]) # returns 1st quartile
iqr <- upperq-lowerq
extreme.outliers.upper <- (iqr * 3) + upperq
extreme.outliers.lower <- lowerq - (iqr * 3)
extreme.outliers<-which(column > extreme.outliers.upper
| column < extreme.outliers.lower)
print(paste("Extreme outlier:", extreme.outliers))
return(extreme.outliers)
}
## [1] 0.2067221
Analysis: No correlation, between profit and IDV with original data. From our univariate analysis we saw we have highly skewed data for IDV. So we will get all extreme outliers from our IDV variable
Also from our univariate analysis we saw we have highly skewed data for Profit too So get all extreme outliers from our profit variable
## [1] 0.7021029
% imrovement in correlation value after handling outliers from both numeric variables.
## [1] 239.6361
Compairing claim with gender,age,zone,revenue,IDV,Vehicle category.
ClaimsInd and gender
Analysis: We can see from the plot that female claim slightly more than male.
Claimed vs age
Analysis: From the plot we can see that age range 35-44 claims slightly more than other age range.
Claimed vs Zone
Analysis: From the plot, we can see north zone claims slightly more than other zones.
Claimed vs vehicle category
Analysis: From the plot, we can see that the number of claims made doesn’t vary with the vehicle category .
Profit among Gender
## ins_rm_ex_IDV_pro$Gender ins_rm_ex_IDV_pro$profit
## 1 Female 82.48491
## 2 Male 82.55055
## # A tibble: 2 x 3
## Gender avg std
## <fctr> <dbl> <dbl>
## 1 Female 82.48491 1.638886
## 2 Male 82.55055 1.599992
Analysis: From the box plot we see that there’s not much difference in profit by gender as the mean for both the data are equal. After performing the Anova testing we got the p-value is greater than 0.05, which means that with 95% confidence interval we cannot reject the null hypothesis that there is no relation between profit and IDV. So, our Profit is not affected by gender.
Doing anova testing to verify relation.
## Df Sum Sq Mean Sq F value Pr(>F)
## Gender 1 5 5.244 2.022 0.155
## Residuals 6233 16165 2.593
p-value 0.155, we fail to reject null hypothesis,hence there is no relation b/w profit by gender
Profit among Age
## ins_rm_ex_IDV_pro$Age ins_rm_ex_IDV_pro$profit
## 1 18-24 82.56411
## 2 25-34 82.59072
## 3 35-44 82.46932
## 4 45-54 82.55855
## 5 55-64 82.55965
## 6 65+ 82.43581
## # A tibble: 6 x 3
## Age avg std
## <fctr> <dbl> <dbl>
## 1 18-24 82.56411 1.655323
## 2 25-34 82.59072 1.600675
## 3 35-44 82.46932 1.657083
## 4 45-54 82.55855 1.546252
## 5 55-64 82.55965 1.613356
## 6 65+ 82.43581 1.547866
Analysis: From the box plot we see that there’s not much difference in profit by zone, the mean for both the data are equal. After performing the Anova testing we got the p-value is greater than 0.05, which means that with 95% confidence interval we cannot reject the null hypothesis that there is no relation between profit and Age Group. So, our Profit is not affected by Age Group.
Doing anova testing to verify relation.
## Df Sum Sq Mean Sq F value Pr(>F)
## Age 5 17 3.427 1.322 0.252
## Residuals 6229 16153 2.593
We fail to reject null hypothesis, hence there is no relation b/w profit by gender
Profit among Zone
## ins_rm_ex_IDV_pro$Zone ins_rm_ex_IDV_pro$profit
## 1 East 82.57871
## 2 North 82.51103
## 3 South 82.57034
## 4 West 82.51009
## # A tibble: 4 x 3
## Zone avg std
## <fctr> <dbl> <dbl>
## 1 East 82.57871 1.557689
## 2 North 82.51103 1.620769
## 3 South 82.57034 1.656531
## 4 West 82.51009 1.525994
Doing anova testing to verify relation.
## Df Sum Sq Mean Sq F value Pr(>F)
## Zone 3 6 1.872 0.722 0.539
## Residuals 6231 16165 2.594
We fail to reject null hypothesis, hence there is no relation b/w profit by gender
Profit among Vehicle Category
## ins_rm_ex_IDV_pro$Vehicle_Cat ins_rm_ex_IDV_pro$profit
## 1 CC-large sized 84.35655
## 2 CC-medium sized 82.91999
## 3 CC-small sized 81.88915
## # A tibble: 3 x 3
## Vehicle_Cat avg std
## <fctr> <dbl> <dbl>
## 1 CC-large sized 84.35655 1.741308
## 2 CC-medium sized 82.91999 1.634476
## 3 CC-small sized 81.88915 1.093841
Analysis: From the box plot we see that there’s not much difference in profit by zone, the mean for both the data are equal. After performing the Anova testing we got the p-value is greater than 0.05, which means that with 95% confidence interval we cannot reject the null hypothesis that there is no relation between profit and Age Group. So, our Profit is not affected by Age Group.
Doing anova testing to verify relation.
## Df Sum Sq Mean Sq F value Pr(>F)
## Vehicle_Cat 2 4272 2136.0 1119 <0.0000000000000002 ***
## Residuals 6232 11898 1.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = profit ~ Vehicle_Cat, data = ins_rm_ex_IDV_pro)
##
## $Vehicle_Cat
## diff lwr upr p adj
## CC-medium sized-CC-large sized -1.436563 -1.573586 -1.2995402 0
## CC-small sized-CC-large sized -2.467401 -2.596851 -2.3379506 0
## CC-small sized-CC-medium sized -1.030837 -1.121245 -0.9404296 0
There is a relation between profit and vehicle category, With large-sized vehicle,the most profitable for the company
*Note:
+We have removed the outliers (1279+188) but we have also compared our analysis with the original data for all factors. +There is no major change in the trend of our anlysis, so we can safely assume to remove our outliers.
trying few transformation for a better distribution of profit
No effect of taking log;
No effect of taking sqrt;
Will cotinue our analysis with default;
Model #1
Creating model by taking only numeric variable i.e IDV
##
## Call:
## lm(formula = profit ~ IDV, data = ins_rm_ex_IDV_pro)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.5617 -0.2535 -0.0843 0.2650 6.1736
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80.43711723966 0.03059346105 2629.23 <0.0000000000000002 ***
## IDV 0.00000577385 0.00000007417 77.84 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.147 on 6233 degrees of freedom
## Multiple R-squared: 0.4929, Adjusted R-squared: 0.4929
## F-statistic: 6060 on 1 and 6233 DF, p-value: < 0.00000000000000022
Model #2
Creating model by taking all the variable
##
## Call:
## lm(formula = profit ~ IDV + Gender + Vehicle_Cat + Age + Zone +
## Year, data = ins_rm_ex_IDV_pro)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.6810 -0.2087 0.0409 0.3057 6.0525
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 80.7162062784 0.1190821358 677.820
## IDV 0.0000065859 0.0000001234 53.353
## GenderMale 0.0842517237 0.0323590474 2.604
## Vehicle_CatCC-medium sized 0.0364740376 0.0548477614 0.665
## Vehicle_CatCC-small sized 0.1876735984 0.0670843519 2.798
## Age25-34 0.0262043764 0.0489872183 0.535
## Age35-44 -0.0768689473 0.0458444205 -1.677
## Age45-54 0.0329628583 0.0476694745 0.691
## Age55-64 -0.0425918737 0.0575798498 -0.740
## Age65+ -0.0267083154 0.0649921829 -0.411
## ZoneNorth -0.0909098648 0.0441521145 -2.059
## ZoneSouth -0.0342999740 0.0495941442 -0.692
## ZoneWest -0.1267021763 0.0579187750 -2.188
## Year2006 -0.0975678394 0.0872213229 -1.119
## Year2007 -0.3580491535 0.0848796624 -4.218
## Year2008 -0.6176543067 0.0847941500 -7.284
## Year2009 -0.8296783768 0.0841800387 -9.856
## Year2010 -0.8760355204 0.0826890728 -10.594
## Year2011 -0.8467044504 0.0848576902 -9.978
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## IDV < 0.0000000000000002 ***
## GenderMale 0.00925 **
## Vehicle_CatCC-medium sized 0.50607
## Vehicle_CatCC-small sized 0.00516 **
## Age25-34 0.59272
## Age35-44 0.09364 .
## Age45-54 0.48928
## Age55-64 0.45951
## Age65+ 0.68113
## ZoneNorth 0.03953 *
## ZoneSouth 0.48921
## ZoneWest 0.02874 *
## Year2006 0.26334
## Year2007 0.000024963290449 ***
## Year2008 0.000000000000364 ***
## Year2009 < 0.0000000000000002 ***
## Year2010 < 0.0000000000000002 ***
## Year2011 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.119 on 6216 degrees of freedom
## Multiple R-squared: 0.5183, Adjusted R-squared: 0.5169
## F-statistic: 371.6 on 18 and 6216 DF, p-value: < 0.00000000000000022
Checking for Multicollinearity. using VIF (Variance Inflation Factor)
vif(mod.2)
## GVIF Df GVIF^(1/(2*Df))
## IDV 2.907495 1 1.705138
## Gender 1.017047 1 1.008488
## Vehicle_Cat 2.448380 2 1.250892
## Age 1.011660 5 1.001160
## Zone 1.008092 3 1.001344
## Year 1.521716 6 1.035606
sqrt(vif(mod.2)) > 2
## GVIF Df GVIF^(1/(2*Df))
## IDV FALSE FALSE FALSE
## Gender FALSE FALSE FALSE
## Vehicle_Cat FALSE FALSE FALSE
## Age FALSE TRUE FALSE
## Zone FALSE FALSE FALSE
## Year FALSE TRUE FALSE
If any variable is true, we would need to drop it. Year & Age need to go.
Model #3
Creating model by dropping age and year
##
## Call:
## lm(formula = profit ~ IDV + Gender + Vehicle_Cat + Zone, data = ins_rm_ex_IDV_pro)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4266 -0.2533 -0.0762 0.2588 6.1029
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 80.7526245520 0.0911674122 885.762
## IDV 0.0000055037 0.0000001033 53.293
## GenderMale 0.0706957681 0.0328415068 2.153
## Vehicle_CatCC-medium sized -0.2070672376 0.0536704401 -3.858
## Vehicle_CatCC-small sized -0.2571246921 0.0617809097 -4.162
## ZoneNorth -0.0816975666 0.0451195235 -1.811
## ZoneSouth -0.0160514480 0.0506483566 -0.317
## ZoneWest -0.1178341334 0.0591955365 -1.991
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## IDV < 0.0000000000000002 ***
## GenderMale 0.031386 *
## Vehicle_CatCC-medium sized 0.000115 ***
## Vehicle_CatCC-small sized 0.000032 ***
## ZoneNorth 0.070237 .
## ZoneSouth 0.751315
## ZoneWest 0.046569 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.145 on 6227 degrees of freedom
## Multiple R-squared: 0.4954, Adjusted R-squared: 0.4948
## F-statistic: 873.2 on 7 and 6227 DF, p-value: < 0.00000000000000022
R-squared: 49.48%; back to close to 1st model have to drop zone as very less significant
Model #4
Creating model by taking only variables with which we have got relation in our testing and adding Gender
##
## Call:
## lm(formula = profit ~ IDV + Vehicle_Cat + Gender, data = ins_rm_ex_IDV_pro)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4422 -0.2508 -0.0791 0.2593 6.1466
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 80.6907726278 0.0827948160 974.587
## IDV 0.0000055018 0.0000001033 53.258
## Vehicle_CatCC-medium sized -0.2054602269 0.0536739749 -3.828
## Vehicle_CatCC-small sized -0.2565468917 0.0618012713 -4.151
## GenderMale 0.0725237512 0.0328280543 2.209
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## IDV < 0.0000000000000002 ***
## Vehicle_CatCC-medium sized 0.00013 ***
## Vehicle_CatCC-small sized 0.0000335 ***
## GenderMale 0.02720 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.145 on 6230 degrees of freedom
## Multiple R-squared: 0.4948, Adjusted R-squared: 0.4945
## F-statistic: 1525 on 4 and 6230 DF, p-value: < 0.00000000000000022
R-squared: 49.45%; no improvement still.
Now Regression diagnostics on Model #4
## integer(0)
## 1031 3494 532 4025 2029 3933
## 0.003376116 0.003376724 0.003386327 0.003600780 0.003691551 0.003846753
##
## Call:
## lm(formula = profit ~ IDV + Vehicle_Cat + Gender, data = ins_rm_ex_IDV_pro_s)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4426 -0.2508 -0.0790 0.2593 6.1471
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 80.6902889175 0.0828297571 974.170
## IDV 0.0000055023 0.0000001034 53.238
## Vehicle_CatCC-medium sized -0.2049044771 0.0537076679 -3.815
## Vehicle_CatCC-small sized -0.2563707587 0.0618365396 -4.146
## GenderMale 0.0723120275 0.0328399556 2.202
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## IDV < 0.0000000000000002 ***
## Vehicle_CatCC-medium sized 0.000137 ***
## Vehicle_CatCC-small sized 0.0000343 ***
## GenderMale 0.027705 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.145 on 6226 degrees of freedom
## Multiple R-squared: 0.4947, Adjusted R-squared: 0.4944
## F-statistic: 1524 on 4 and 6226 DF, p-value: < 0.00000000000000022
## 3471 661 4305 3127 3809 4503 3795 4296 7554 4613
## 0 0 0 0 0 0 0 0 0 0
Model #5
Creating model by by removed outlier data from model 4
##
## Call:
## lm(formula = profit ~ IDV + Gender + Vehicle_Cat, data = ins_rm_ex_IDV_pro_r)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4417 -0.2509 -0.0791 0.2600 6.1466
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 80.6914262928 0.0829007427 973.350
## IDV 0.0000055003 0.0000001034 53.171
## GenderMale 0.0725144261 0.0328781918 2.206
## Vehicle_CatCC-medium sized -0.2054117718 0.0537545919 -3.821
## Vehicle_CatCC-small sized -0.2569204624 0.0619026004 -4.150
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## IDV < 0.0000000000000002 ***
## GenderMale 0.027452 *
## Vehicle_CatCC-medium sized 0.000134 ***
## Vehicle_CatCC-small sized 0.0000336 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.146 on 6221 degrees of freedom
## Multiple R-squared: 0.4943, Adjusted R-squared: 0.494
## F-statistic: 1520 on 4 and 6221 DF, p-value: < 0.00000000000000022
R-squared: 49.4%; no improvement from model.4 So finally going forward with model#4 so we can explain variance in our model at best by 49.45%
## `geom_smooth()` using method = 'gam'
The plot above shows that IDV is linearly related to profit, for large sized vehicle it is most linear, and for medium sized the plot is linear after a certain point and for small sized vehicle also it is also linear after a certain point.
The model has _R-square value of 49.4%._
o Per unit IDV increase of increase the profit by 0.0000055.
o Gender brings 0.07251 more profit than Female gender.
o Medium sized vehicle brings 0.205411 less profit than large size vehicle.
o Small sized vehicle brings 0.2569204 less profit than large size vehicle.
There are many other factors like experience of driver, driver’s state of mind and severity of accident which may further affect the profitability and affect the models (r-square) value.
As per our analysis based on the available dataset,
Profit is not dependent on age & gender of the driver.
Profit is not dependent on any geographical zone.
Profit of the company increases with the increase in cubic capacity of the vehicle.
We have analyzed the dataset for identifying the factors impacting the revenue and profit generation of the company. As per our analysis, factors such as age of the driver, gender of the driver and geographical zone which driver belongs, don’t affect the profit of the company. There might be several reasons for these findings which we cannot explore due to limitations imposed by the data.
The personal traits of the driver and traffic in specific region affects the probability of accidents which in turn impacts the revenue. Experience, vision, state of mind while driving, alcohol consumption and medication consumed during driving might influence the probability of the accidents.
Auto insurance companies in India don’t consider the gender and age of the drivers to calculate the insurance premiums. Instead age of the vehicle and history of accidents is considered to define the premiums. We are recommending company to focus on insuring the vehicles with higher IDV values.